ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

jaylim-crl · 2022-02-19T05:12:44Z

ccl/sqlproxyccl: add last message details to the forwarder's processor

This commit adds last message details to the forwarder's processor, and in
particular, lastMessageType and lastMessageTransferredAt. The latter is
implemented using a simple logical clock, which will be used to determine
ordering of events. These two will be used during connection migration to
determine a safe transfer point.

At the same time, we plumb the connector and metrics into the processor, which
will be used in subsequent commits.

Release justification: Low risk, sqlproxy-only change.

Release note: None

ccl/sqlproxyccl: support waitResumed on the processor to block until resumption

Previously, there could be a race where suspend() was called right after
resuming the processors. If the processor goroutines have not started, suspend
will implicitly return, leading to a violation of an invariant, where we want
the processors to be suspended before proceeding. This commit adds a new
waitResumed method on the processor that allows callers to block until the
processors have been resumed.

Release justification: sqlproxy-only change.

Release note: None

ccl/sqlproxyccl: complete connection migration support in the forwarder

Informs #76000.

This commit completes the connection migration feature in the the forwarder
within sqlproxy. The idea is as described in the RFC.

A couple of new sqlproxy metrics have been added as well:

proxy.conn_migration.success
proxy.conn_migration.error_fatal
proxy.conn_migration.error_recoverable
proxy.conn_migration.attempted

For more details, see metrics.go in the sqlproxyccl package.

Release justification: This completes the first half of the connection
migration feature. This is low risk as part of the code is guarded behind the
connection migration feature, which is currently not being used in production.
To add on, CockroachCloud is the only user of sqlproxy.

Release note: None

cockroach-teamcity · 2022-02-19T05:12:56Z

This change is

jaylim-crl

The structure for the connection migration work is there. Note that this is untested, and was inspired mostly from the initial prototype. Putting this up in case anyone has thoughts on the direction. I'll be adding tests next.

When it comes to the forwarder's API, I plan to have a Forwarder interface that implements just two methods: RequestTransfer and Close. With the current design, callers wouldn't know whether the transfer has failed or succeeded after calling RequestTransfer. I think that is fine for now, and if we really need to know the last transfer error, we could address that in the upcoming RFC. That said, the initial thought is to have the forwarder notify a "forwarder tracking" component (i.e. map of pods to forwarders) if it connects/disconnects from serverConn. The proxy can then check which pod the forwarder is assigned to (versus the previous pod) to know whether the transfer has succeeded.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/forwarder.go, line 52 at r1 (raw file):

)

var (

These error messages might be unnecessary, at least for the ones where they are only used once. We could just do return errors.New there if we wanted to.

pkg/ccl/sqlproxyccl/forwarder.go

jaylim-crl

TFTRs!

This draft PR is ready for another look. I added more tests, and changed how we handle timeout handlers. I also tested this, and it works as expected.

This test is still WIP, and is not included in this PR. Will update that later today. Here, we call RequestTransfer in a loop, and constantly query the SQL pod details. We can see that we alternate between pods, which implies that connection migration works:

=== RUN   TestConnectionMigration/successful_connection
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:39917/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}
{postgresql://[email protected]:36529/defaultdb?sslcert=test_certs%2Fclient.root.crt&sslkey=test_certs%2Fclient.root.key&sslmode=verify-full&sslrootcert=test_certs%2Fca.crt true}

There are three more items to address:

Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
Add TestConnectionMigration (which is an end-to-end test) to this PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/forwarder.go, line 269 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Can we do this without atomics? I do not like using atomics for control flow. Getting atomic programming right is infamously difficult. In this case there is a race condition that occurs if RequestTransfer is called concurrently and it's not clear to me how to fix it. At a minimum CompareAndSwap needs to be used as the guard for this state transfer. But then there is a race with setting transferKey and transferCh.

Done - good catch re: concurrent calls. As discussed, I have moved this to use locks. My benchmarks show that using locks within a 30s workload run only slows the overall time by a few hundred ms, so I'd be fine with using locks everywhere instead of a mixture of both.

pkg/ccl/sqlproxyccl/forwarder.go, line 440 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

idea: Instead of implementing the forwarding goroutines as a state machine, initiating the transfer could ask both of the goroutines to terminate. That allows us to implement transfer the same way authenticate is implemented. As a single function and goroutine. Once the transfer is canceled or completed, the goroutine managing the transfer could restart the forwarding goroutines.

IMO the main advantage to implementing it by stopping the forwarding goroutines is we can keep the steady state code simple and we can group all of the transfer logic together.

As discussed, this would increase the overall transfer latency since we'd have to wait for the server-to-client processor to stop before sending the transfer state message, so I kept the current approach. I like the idea as it makes things cleaner though, but the tradeoff wouldn't be worth it.

pkg/ccl/sqlproxyccl/forwarder.go, line 514 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

This logic can be simplified by context.WithTimeout.

transferCtx, cancel := context.WithTimeout(timeout, ctx)
defer cancel()
err := transfer(trasferCtx)
if transferCtx.Err() && !ctx.Err() {
return errTransferTimeout
}
return err

Alternatively, if we do not specifically care about errTransferTimeout, it can use context.WithTimeout.

Hm, this probably isn't relevant anymore since I moved it elsewhere.

jaylim-crl

There are three more items to address:

1 .Implement deserialize_session. Given that this PR is huge, I might just push that to a subsequent PR.
2. Perhaps move the changes in authentication.go in another PR. One thing that was overlooked is that the server still sends back initial connection messages in the token-based auth approach, so we'd have to slurp those messages.
3. Add TestConnectionMigration (which is an end-to-end test) to this PR.

(2) and (3) are done. I've decided to just leave them in this PR, but happy to break them down into separate PRs if reviewers prefer otherwise. (1) will definitely come in a later PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/forwarder.go, line 629 at r3 (raw file):

// NOTE: This should only be called during a transfer process. We assume that
// transferCloserCh has already been initialized.
func (f *forwarder) runTransferTimeoutHandler(waitTimeout time.Duration) {

The end-to-end transfer process works, but I'm not happy with this function. The unfortunate thing is that we unblock clients through context, but for servers, we'd do that by closing the forwarder. This goroutine is needed to ensure that we can terminate if we timeout when sending the SHOW TRANSFER STATE query to the server. Also perhaps I ran into a rabbit hole with all the transferConnRecoverable stuff, but I don't see a better way to address that. Happy to discuss.

closer is needed instead of relying on transferCtx.Done() to ensure that we can distinguish between a transfer timeout, and an actual transfer completion. If we block on transferCtx.Done() in the client-to-server processor, then when the timeout handler gets triggered, we will unblock automatically, which is incorrect.

Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None

jaylim-crl

FYI, I extracted some of the work here in #77109. I'll do the same for the auth stuff.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None

Informs cockroachdb#76000. Extracted from cockroachdb#76805. Previously, we assumed that with the token-based authentication, the server is ready to accept queries the moment we connect to it. This assumption has been proved wrong during the integration tests with the forwarder, and there's an implicit AuthenticateOK step that happens after connecting to the server. During that time, initial connection data such as ParameterStatus and BackendKeyData messages will be sent to the client as well. For now, we will ignore those messages. Once we start implementing query cancellation within the proxy, that has to be updated to cache the new BackendKeyData. Release note: None

Informs cockroachdb#76000. Extracted from cockroachdb#76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. We also introduce a new error marker for errors that denote connections which are recoverable. In our case, we convert SQL errors from first column of the SHOW TRANSFER STATE query (e.g. serialization errors) to Go errors, and mark them with the errConnRecoverableSentinel sentinel. Release note: None

jaylim-crl

OK - this is ready for another round of review. Some tests are TODO, which will be addressed before merging, but I'm putting this up for a review first. Only look at the last three commits.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)

jeffswenson

This is looking great! I have a few minor comments.

pkg/ccl/sqlproxyccl/forwarder.go

pkg/ccl/sqlproxyccl/conn_migration.go

jeffswenson · 2022-03-11T00:01:02Z

pkg/ccl/sqlproxyccl/conn_migration.go

+		}
+	}()
+
+	f.mu.Lock()


This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.

The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.

I'm thinking something like:

func (f* forwarder) getProcessors() (request processor, response processor) { f.mu.Lock() defer f.mu.Unlock() return f.mu.request, f.mu.response } func (f* forwarder) replaceServerConn(newServerConn net.Conn) { f.mu.Lock() defer f.mu.Unlock() // Transfer was successful. clockFn := makeLogicalClockFn() f.mu.serverConn.Close() f.mu.serverConn = newServerConn f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn) f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn) } ... request, response := getProcessors() // Suspend both processors before starting the transfer. if err := request.suspend(ctx); err != nil { return errors.Wrap(err, "suspending request processor") } if err := response.suspend(ctx); err != nil { return errors.Wrap(err, "suspending response processor") } // Transfer the connection. newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn) if err != nil { return errors.Wrap(err, "transferring connection") } f.replaceServerConn(newServerConn) return nil

This commit adds last message details to the forwarder's processor, and in particular, lastMessageType and lastMessageTransferredAt. The latter is implemented using a simple logical clock, which will be used to determine ordering of events. These two will be used during connection migration to determine a safe transfer point. At the same time, we plumb the connector and metrics into the processor, which will be used in subsequent commits. Release justification: Low risk, sqlproxy-only change. Release note: None

…resumption Previously, there could be a race where suspend() was called right after resuming the processors. If the processor goroutines have not started, suspend will implicitly return, leading to a violation of an invariant, where we want the processors to be suspended before proceeding. This commit adds a new waitResumed method on the processor that allows callers to block until the processors have been resumed. Release justification: sqlproxy-only change. Release note: None

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I don't think logCtx is necessary. I looked at the implementation of log and I did not find any calls that check ctx.Err(). If it did, I would argue that is a bug. It is normal to log something that goes wrong during the cancellation process.

I didn't have this initially, but the process panics with a span-use-after-close error:

panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl	2.852s

pkg/ccl/sqlproxyccl/conn_migration.go, line 148 at r22 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

This mutex is held for most of the transfer process. The processors are thread safe, and there is the isTransferring state, so we can avoid most of this locking if we create helpers for retrieving and setting the processors.

The code as written is correct. I like to keep critical sections relatively short in an effort to reduce the probability of deadlocks and contention.

I'm thinking something like:

func (f* forwarder) getProcessors() (request processor, response processor)  {
 	f.mu.Lock()
	defer f.mu.Unlock()
	return f.mu.request, f.mu.response
}

func (f* forwarder) replaceServerConn(newServerConn net.Conn) {
	f.mu.Lock()
	defer f.mu.Unlock()
	// Transfer was successful.
	clockFn := makeLogicalClockFn()
	f.mu.serverConn.Close()
	f.mu.serverConn = newServerConn
	f.mu.request = newProcessor(clockFn, f.mu.clientConn, f.mu.serverConn)
	f.mu.response = newProcessor(clockFn, f.mu.serverConn, f.mu.clientConn)
}

...

request, response := getProcessors()    

// Suspend both processors before starting the transfer.
if err := request.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending request processor")
}
if err := response.suspend(ctx); err != nil {
    return errors.Wrap(err, "suspending response processor")
}

// Transfer the connection.
newServerConn, err := transferConnection(ctx, f.connector, f.mu.clientConn, f.mu.serverConn)
if err != nil {
    return errors.Wrap(err, "transferring connection")
}
f.replaceServerConn(newServerConn)
	
return nil

Done. Updated it to your suggestion, but also introduced getConns for the client and server connections.

pkg/ccl/sqlproxyccl/forwarder.go, line 190 at r18 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

😄 On a 100gbps ethernet line, sending the smallest possible pg message, with no protocol overhead, it would take 217 years of continuous transmission to overflow an int64 counter.
>>> line_rate = 100 * 2 ** 30 # 100 gibs
>>> message_size = 5 * 8 # minimum message size of 8 bytes
>>> msg_per_second = line_rate / message_size
>>> seconds_to_overflow = 2 ** 64 / msg_per_second 
>>> years_to_overflow = seconds_to_overflow / (60 * 60 * 24 * 365)
>>> years_to_overflow
217.908031253171

😄

pkg/ccl/sqlproxyccl/forwarder.go, line 320 at r18 (raw file):

It would require passing peekErr and typ as arguments.

Doing this muddies the logic in exitPeek, and I'd rather keep it this way with the extra mutex lock+unlock. If this proves to be an issue, we could easily optimize this at a later time. Doing it the current way also makes sure that updating the last message details is tied to the ForwardMsg call, and not when exiting.

pkg/ccl/sqlproxyccl/forwarder.go, line 83 at r20 (raw file):
Done. Added a note about that.

Generally I would advise against acquiring multiple locks at the same time because it is a deadlock risk.

I think as long as we're aware about it, such risks can be minimized. Also, using a single lock for all three structs is going to cause contention between the request and response processors especially when queries are pipelined.

Informs cockroachdb#76000. This commit completes the connection migration feature in the the forwarder within sqlproxy. The idea is as described in the RFC. A couple of new sqlproxy metrics have been added as well: - proxy.conn_migration.success - proxy.conn_migration.error_fatal - proxy.conn_migration.error_recoverable - proxy.conn_migration.attempted For more details, see metrics.go in the sqlproxyccl package. Release justification: This completes the first half of the connection migration feature. This is low risk as part of the code is guarded behind the connection migration feature, which is currently not being used in production. To add on, CockroachCloud is the only user of sqlproxy. Release note: None

jeffswenson

LGTM!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)

pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

I didn't have this initially, but the process panics with a span-use-after-close error:

panic: use of Span after Finish. Span: proxy-con-serve. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

goroutine 2224 [running]:
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish(0xb7faeb8)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:185 +0x134
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).RecordingType(0xc0011d2900)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:387 +0x1e
github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).IsVerbose(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:395
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog({0x8f543d0, 0xc005301580})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:92 +0x3a
github.com/cockroachdb/cockroach/pkg/util/log.logfDepth({0x8f543d0, 0xc005301580}, 0x1, 0x1, 0x0, {0x7d9f468, 0x1f}, {0xc0051a7d60, 0x1, 0x1})
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/channels.go:55 +0x2a8
github.com/cockroachdb/cockroach/pkg/util/log.Infof(...)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/util/log/log_channels_generated.go:263
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer.func2()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:145 +0x1f8
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).runTransfer(0xc00517d200)
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/conn_migration.go:174 +0x5d7
github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer.func1()
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x1d
created by github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.(*forwarder).RequestTransfer
	/Users/jaylim/workspace/src/github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/forwarder.go:154 +0x5b
FAIL	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl	2.852s

Jay and I dug into this off line. The root cause is the sqlproxy is using stopper incorrectly. This will be fixed in a separate PR. #77689

pkg/ccl/sqlproxyccl/forwarder.go, line 158 at r16 (raw file):

Previously, jaylim-crl (Jay Lim) wrote…

@jeffswenson, how strongly do you feel that we need to guard these variables with a mutex when they don't need them? I've attempted to do so, but it was extremely difficult to read/follow, so I aborted the idea.

This was resolved by the redesign.

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/conn_migration.go, line 125 at r22 (raw file):

The root cause is the sqlproxy is using stopper incorrectly.

I guess it's more like we're not using the stopper entirely. Some places use stopper.RunAsyncTask, whereas others use the go keyword directly. The latter is handy as we don't need to thread the stopper all the way down. Fixing all uses of the go keyword isn't straightforward. We'd also have to rethink how we want traces to be displayed.

jaylim-crl · 2022-03-13T04:43:08Z

bors r=JeffSwenson

craig · 2022-03-13T05:39:24Z

Build succeeded:

GitHub CI (Cockroach)

Previously, we added support for connection migration in cockroachdb#76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None

…e size Informs cockroachdb#76000, and follow up to cockroachdb#76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None

Previously, we added support for connection migration in cockroachdb#76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None

…e size Informs cockroachdb#76000, and follow up to cockroachdb#76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None

77442: sqlproxyccl: Add codeUnavailable to the list of error codes r=alyshanjahani-crl a=alyshanjahani-crl Release justification: fixes for high-priority bug in existing functionality Previously, if a tenant cluster had maxPods set to 0 the error returned by directory.EnsureTenantAddr was not treated as a non-retryable error. The tenant directory implementation used in CockroachCloud now identifies this situation and returns a status error with codes.FailedPrecondition and a descriptive message. This patch adds a check for the FailedPrecondition error in connector.lookupAddr. Release note: None 77626: sql: re-enable txnidcache by default r=maryliag,ajwerner a=Azhng Previously, TxnID cache was disabled by default due to performance issues. This has since been resolved in #77220. This commit re-enables TxnID Cache by default. Release justification: low risk high reward change. Release note: None 77675: cmd/roachtest: tweaks to sstable corruption test r=itsbilal a=nicktrav A few minor changes to the existing roachtest: Reduce the warehouse count on ingest back to 100 - this should be ample, and it matches the warehouse count on the read path in a later part of the test. Change the stop options to wait until the process has been terminated before continuing. Re-work the command to search for SSTables to corrupt - use tables from the most recent manifest (if multiple are present); consider SSTables with either a start or end key representing a user table; shuffle tables to distribute corruption over the LSM; perform filtering in the bash command rather than in the test runner itself. Release justification: Roachtest-only change. Release note: None. Touches #77321. 77700: ccl/sqlproxyccl: add connection migration-related metrics r=JeffSwenson a=jaylim-crl #### ccl/sqlproxyccl: add metric that measures connection migration latency Previously, we added support for connection migration in #76805. This commit adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency for attempted connection migrations. Having this metric would be beneficial as we can now know how long users were blocked during connection migrations. Release justification: Low-risk metric-only change within sqlproxy. Release note: None #### ccl/sqlproxyccl: add metric that records the transfer response message size Informs #76000, and follow up to #76805. This commit adds a new proxy.conn_migration.transfer_response.message_size metric that will track the distribution of the transfer response message size. This will be used to tune a value for the SQL cluster setting: sql.session_transfer.max_session_size. Release justification: Low-risk metric-only change within sqlproxy. Release note: None Co-authored-by: Alyshan Jahani <[email protected]> Co-authored-by: Azhng <[email protected]> Co-authored-by: Nick Travers <[email protected]> Co-authored-by: Jay <[email protected]>

jaylim-crl requested review from jeffswenson and andy-kimball February 19, 2022 05:12

jaylim-crl commented Feb 19, 2022

View reviewed changes

jaylim-crl mentioned this pull request Feb 19, 2022

sql, ccl/sqlproxyccl: connection migration tracking issue #76000

Closed

14 tasks

jaylim-crl force-pushed the jay/connection-migration branch from 4797bdf to 009f2d6 Compare February 19, 2022 16:27

jeffswenson reviewed Feb 22, 2022

View reviewed changes

pkg/ccl/sqlproxyccl/forwarder.go Outdated Show resolved Hide resolved

pkg/ccl/sqlproxyccl/forwarder.go Outdated Show resolved Hide resolved

pkg/ccl/sqlproxyccl/forwarder.go Outdated Show resolved Hide resolved

jaylim-crl force-pushed the jay/connection-migration branch from 009f2d6 to e7730d1 Compare February 24, 2022 21:19

jaylim-crl requested a review from jeffswenson February 24, 2022 21:30

jaylim-crl commented Feb 24, 2022

View reviewed changes

jaylim-crl force-pushed the jay/connection-migration branch from e7730d1 to 6a8d610 Compare February 25, 2022 05:32

jaylim-crl changed the title ~~[DNM] ccl/sqlproxyccl: add connection migration support to the forwarder~~ ccl/sqlproxyccl: add connection migration support to the forwarder Feb 25, 2022

jaylim-crl changed the title ~~ccl/sqlproxyccl: add connection migration support to the forwarder~~ ccl/sqlproxyccl: add partial connection migration support to the forwarder Feb 25, 2022

jaylim-crl force-pushed the jay/connection-migration branch from 6a8d610 to 87b8ea5 Compare February 25, 2022 05:33

jaylim-crl marked this pull request as ready for review February 25, 2022 05:35

jaylim-crl requested a review from a team as a code owner February 25, 2022 05:35

jaylim-crl force-pushed the jay/connection-migration branch from 87b8ea5 to e3bb0d3 Compare February 25, 2022 06:21

jaylim-crl commented Feb 25, 2022

View reviewed changes

jaylim-crl mentioned this pull request Feb 27, 2022

ccl/sqlproxyccl: add helpers related to connection migration #77109

Merged

jaylim-crl commented Feb 27, 2022

View reviewed changes

jaylim-crl mentioned this pull request Feb 27, 2022

ccl/sqlproxyccl: handle implicit auth in OpenTenantConnWithToken #77111

Merged

jaylim-crl force-pushed the jay/connection-migration branch from e3bb0d3 to 4357e45 Compare February 28, 2022 06:17

jaylim-crl changed the title ~~ccl/sqlproxyccl: add partial connection migration support to the forwarder~~ ccl/sqlproxyccl: complete connection migration support in the forwarder Feb 28, 2022

jaylim-crl force-pushed the jay/connection-migration branch 2 times, most recently from a095f21 to cd6e8dd Compare February 28, 2022 12:47

jaylim-crl force-pushed the jay/connection-migration branch from 2fdda2d to fddfd30 Compare March 10, 2022 19:29

jaylim-crl requested a review from a team as a code owner March 10, 2022 19:29

jaylim-crl removed the request for review from a team March 10, 2022 19:30

jaylim-crl commented Mar 10, 2022

View reviewed changes

jaylim-crl force-pushed the jay/connection-migration branch 2 times, most recently from 97f79e3 to 9661512 Compare March 10, 2022 21:02

jeffswenson reviewed Mar 11, 2022

View reviewed changes

jaylim-crl added 2 commits March 11, 2022 11:30

jaylim-crl force-pushed the jay/connection-migration branch from 9661512 to d6a7488 Compare March 11, 2022 16:50

jaylim-crl requested a review from jeffswenson March 11, 2022 16:51

jaylim-crl commented Mar 11, 2022

View reviewed changes

jaylim-crl force-pushed the jay/connection-migration branch from d6a7488 to 64166a9 Compare March 11, 2022 18:09

jaylim-crl force-pushed the jay/connection-migration branch from 64166a9 to 8610953 Compare March 11, 2022 18:28

jeffswenson approved these changes Mar 11, 2022

View reviewed changes

jaylim-crl mentioned this pull request Mar 11, 2022

ccl/sqlproxyccl: add connection migration-related metrics #77700

Merged

jaylim-crl commented Mar 13, 2022

View reviewed changes

craig bot merged commit dc5432d into cockroachdb:master Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

jaylim-crl commented Feb 19, 2022 •

edited

Loading

cockroach-teamcity commented Feb 19, 2022

jaylim-crl left a comment

jaylim-crl left a comment

jaylim-crl left a comment

jaylim-crl left a comment

jaylim-crl left a comment

jeffswenson left a comment

jeffswenson Mar 11, 2022

jaylim-crl left a comment

jeffswenson left a comment

jaylim-crl left a comment

jaylim-crl commented Mar 13, 2022

craig bot commented Mar 13, 2022

ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

ccl/sqlproxyccl: complete connection migration support in the forwarder #76805

Conversation

jaylim-crl commented Feb 19, 2022 • edited Loading

ccl/sqlproxyccl: add last message details to the forwarder's processor

ccl/sqlproxyccl: support waitResumed on the processor to block until resumption

ccl/sqlproxyccl: complete connection migration support in the forwarder

cockroach-teamcity commented Feb 19, 2022

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jeffswenson left a comment

Choose a reason for hiding this comment

jeffswenson Mar 11, 2022

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jeffswenson left a comment

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl commented Mar 13, 2022

craig bot commented Mar 13, 2022

jaylim-crl commented Feb 19, 2022 •

edited

Loading